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Abstract. Identifying the risk factors for mental illnesses is of signifi- 
cant public health importance. Diagnosis, stigma associated with men- 
tal illnesses, comorbidity, and complex etiologies, among others, make 
it very challenging to study mental disorders. Genetic studies of mental 
illnesses date back at least a century ago, beginning with descriptive 
studies based on Mendelian laws of inheritance. A variety of study de- 
signs including twin studies, family studies, linkage analysis, and more 
recently, genomewide association studies have been employed to study 
the genetics of mental illnesses, or complex diseases in general. In this 
paper, I will present the challenges and methods from a statistical per- 
spective and focus on genetic association studies. 

Key words and phrases: Comorbidity, covariate adjusted association 
test, FBAT, Kendall's tau, multiple traits, ordinal traits. 
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1. INTRODUCTION 

Mental illnesses affect the health and well-being of 
all populations and all ages. Schizophrenia — a chro- 
nic, severe, and disabling brain disorder — is one of 
these mental illnesses, affecting about 1.1 percent of 
the U.S. population age 18 and older in a given year. 
People with schizophrenia sometimes hear voices oth- 
ers do not hear, believe that others are broadcasting 
their thoughts to the world, or become convinced 
that others are plotting to harm them. These expe- 
riences can make them fearful and withdrawn and 
cause difficulties when these people try to have re- 
lationships with others (http:/ /www. nimh.nih.gov). 
Emil Kraepelin (1856-1926) described "Dementia 
Praecox" as an inherited disorder in his influential 
"Textbook of Psychiatry" (1899). Dementia Prae- 
cox, coined "schizophrenia," was first used by Arnold 



Heping Zhang is Professor, Yale Sehool of Public 
Health, Yale University, 60 College Street, New Heaven, 
Connecticut 06520-8034, USA e-mail: 
heping. zhang@y ale. edu . 



This is an electronic reprint of the original article 
published by the Institute of Mathematical Statistics in 
Statistical Science, 2011, Vol. 26, No. 1, 116-129. This 
reprint differs from the original in pagination and 
typographic detail. 



Pick (1851-1924) — a professor of psychiatry at the 
German branch of Charles University in Prague — to 
describe a patient with a psychotic disorder resem- 
bling hebephrenia in 1891. 

Nearly a century ago. Cannon and Rosanoff (1911) 
made an attempt to understand whether there are 
any forms of nervous and mental diseases that are 
transmitted from generation to generation in con- 
cordance with Mendelian laws. They examined the 
families of 11 neuropathetic patients, which are now 
referred to as probands in pedigrees. Using Mende- 
lian laws as their theoretical expectation, they con- 
cluded that the neuropathetic make-up is recessive 
to normal. Although the report was indeed "prelimi- 
nary," a few things are noteworthy. First, they noted 
that "any form of insanity or even all the forms of 
hereditary insanity do not constitute an indepen- 
dent hereditary character." This raised an early sign 
of the complexity associated with studying mental 
disorders compared to the characterization of the 
disorders and their comorbidity. Here, comorbidity 
refers to more than one disease condition in the same 
patient. Second, they remarked "should larger accu- 
mulations of such data in the future give similar re- 
sults, we shall be able" to confirm their result. The 
requirement for more samples and replication is an- 
other challenge in studies of complex diseases. Last, 
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but not the least, while they said "let us test, . . . , 
the hypothesis ..." they did not mean a statistical 
test. However, the idea of the x^-test is evident. 

Despite this early work, it was not until the 1960s 
that the researchers began to use scientifically rig- 
orous designs and methods to study the inheritance 
of mental illnesses. For example, the key idea in 
adoption studies lies in the belief that any links be- 
tween an adopted child and the biological parents 
are attributable to genetics, and any links between 
that child and adoptive parents can be attributed 
to environment (Plomin et al., 1997). This enables 
us to separate the confounding environment (i.e., 
a family) from genetic contribution. Consequently, 
there are two strategies in adoption studies. One ap- 
proach compares the risk of developing schizophre- 
nia in the adopted children of schizophrenic parents 
to the risk of adopted children whose parents do not 
have schizophrenia. Several studies including Heston 
(1966), Rosenthal (1972) and Tienari (1991) used 
this approach to study schizophrenia. Each study 
found an elevated risk in adopted-away children of 
schizophrenic parents, supporting the role of genet- 
ics in the transmission of schizophrenia. The origin 
of this approach is the schizophrenic parents. An- 
other approach backtracks from adopted children 
who have developed schizophrenia and compares the 
risks of schizophrenia in their adoptive and biologi- 
cal families. Kety, Rosenthal and Wender (1978) and 
others found that the risk was significantly higher in 
the biological relatives than in the adoptive families, 
again underscoring the role of genetics as a risk fac- 
tor. 

While these schizophrenia adoption studies are 
influential in understanding the role of genetics in 
mental disorders, the majority of the genetic factors 
associated with mental disorders are based on family 
and twin studies. By comparing the concordance in 
the risk between identical (monozygotic) and frater- 
nal (dizygotic) twins, twin studies arguably provide 
the most compelling results about genetic and envi- 
ronmental effects. For example, the concordance in 
monozygotic twins for Tourette's syndrome, a com- 
plex disorder characterized by repetitive, sudden and 
involuntary movements or noises called tics, was re- 
ported to be about 50% whereas it is less than 10% 
in dizygotic twins. 

Twin studies are most helpful in demonstrating 
the magnitude of genetic effect, but they do not pro- 
vide insight into the inheritance pattern of a condi- 
tion. Thus, family studies can offer information that 



twin studies cannot. Thus, Cannon and Rosanoff 
(1911) employed a small-scale, simple family study. 
Using the Mendelian laws, not only might we find 
evidence of genetics, but also infer the mode of trans- 
mission, as Cannon and Rosanoff (1911) concluded 
for the heredity of insanity. 

Although twin and family studies continue to be 
useful for understanding the genetics of complex dis- 
eases, different studies are needed to locate a specific 
gene on a chromosome that may underlie the dis- 
ease. Gene mapping in humans through linkage anal- 
ysis emerged in the 1930s, but it was Morton (1955) 
who laid the foundation for the methodology. It was 
only during the 1970s and 1980s, when the Elston- 
Steward (1971) algorithm was developed and im- 
plemented (Ott, 1974), that the method thrived as 
a common tool of genetic studies. These initial and 
subsequent developments allowed for linkage analy- 
ses of multiple markers simultaneously. In light of 
the sheer number of genes and that we do not know 
which specific gene we are looking for, we typically 
genotype 300 to 400 "landmarks" that cover the 22 
pairs of autosomes and the X chromosome. By infer- 
ring the transmission patterns of these markers, then 
linking them to the disease status, we can obtain in- 
formation about the most probable region where the 
gene of interest resides. 

While linkage studies have had some successes (e.g., 
BRCAl), they have generated many more prema- 
ture excitements. In the late 1980s, two particu- 
lar studies attracted significant public attention af- 
ter they reported that bipolar affective disorders 
were linked to DNA markers on chromosome 11, 
and that a susceptibility locus for schizophrenia was 
located on chromosome 5. Unfortunately, these find- 
ings were not replicated. Replications in genetic stud- 
ies of mental disorders do not come easily. For ex- 
ample, Abelson et al. (2005) identified mutations 
involving the SLITRKl gene (13q31.1) in a small 
number of people with Tourette's syndrome. How- 
ever, most people with Tourette's syndrome do not 
have a mutation in the SLITRKl gene. Because the 
mutations were reported in so few people with this 
condition, the association of the SLITRKl gene with 
this disorder could not be confirmed. In fact, Scharf 
et al. (2008) reported a lack of the association be- 
tween SLITRKlvar321 and Tourette's syndrome in 
a large family-based sample. 

Various reasons have been suggested to explain 
the difficulties detouring progress in genetic stud- 
ies using linkage analysis. A key concept underlying 
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linkage analysis is the recombination fraction, which 
reflects the distance between any two markers, such 
as a DNA marker and the disease locus. There may 
be limited information in the data, however, dimin- 
ishing the power of the linkage study. Furthermore, 
complex diseases are polygenic, involving multiple 
genes (Carter and Chung, 1980). Linkage analyses, 
however, are generally under the assumption of one 
major gene. Additionally, heterogeneity in the di- 
agnosis and comorbidity of mental illnesses make 
linkage analysis considerably more difficult, if even 
possible at all. 

Many investigators have adopted association anal- 
yses to take advantage of the advent of high-through- 
put genotyping technologies. Recent efforts have 
identified genes that contribute to a number of com- 
plex human traits using the ultra-dense genetic mar- 
kers (Arking et al., 2006; Klein et al., 2005; Duerr 
et al, 2006; Chen et al, 2007). Trios (one affected 
offspring and two parents) have been an effective 
design for association studies, particularly with the 
development of the elegant transmission/disequili- 
brium test (TDT) (Spielman, McGinnis and Ewens, 
1993). The central idea of this test is that each af- 
fected child serves as his or her own matched case 
and control. This acts to control for all potential 
confounding issues and examines alleles that both 
are and are not transmitted from the parents. In 
the absence of association between the affective sta- 
tus and the gene, the distributions of the transmit- 
ted and non-transmitted alleles are expected to be 
the same. Deviations in distribution as evaluated 
by a x^'tGst indicate the existence of association. 
Trios are the simplest example of nuclear family, 
but when other siblings are available, the trio design 
is not cost-effective. As a result, family-based asso- 
ciation tests (FBAT) including sibships (Spielman 
and Ewens, 1998; Horvath and Laird, 1998; Knapp, 
1999), nuclear families (Weinberg, 1999; Lunetta et 
al., 2000; Rabinowitz and Laird, 2000) and general 
pedigrees (Martin, Monks, Warren and Kaplan, 2000) 
have been developed. 

Another restriction in the use of trios is the re- 
quirement of defining the affective status of a dis- 
ease. Consequently, association tests have been pro- 
posed for quantitative traits (Allison, 1997; Rabi- 
nowitz, 1997), traits with distribution belonging to 
an exponential family (Liu, Tritchler and Bull, 2002), 
ordinal traits (Zhang, Wang and Ye, 2006; Wang, Ye 
and Zhang, 2006) and multiple traits (Lange et al., 
2003; Zhang, Liu and Wang, 2010). 



Since the early success in identifying the comple- 
ment factor H polymorphism in age-related macu- 
lar degeneration (Klein et al., 2005), case-control 
association studies have intensified, and many ge- 
netic variants have been identified and catalogued 
(Hindorff et al., 2009). Despite the enormous in- 
vestment, the intense attention to the genetics of 
diseases, the rapid improvement in technology, and 
the increasingly large sample sizes in many studies, 
it remains challenging to identify disease genes, es- 
pecially those underlying mental illnesses. Some of 
the common genetic variants that have been iden- 
tified for complex diseases only account for a small 
portion of the genetic risk, which may vary across 
populations (Goldstein, 2009). For example, Kopp 
et al. (2008) and Kao et al. (2008) identified sev- 
eral variations in the MYH9 gene as major contrib- 
utors to excess risk of kidney disease among African- 
Americans. They found that 60 percent of African- 
Americans carry the risk variants as opposed to 4 
percent of white Americans. 

Technology will continue to improve and the amount 
of genetic data will increase. The purpose of this ar- 
ticle is to review some of the progress from a statis- 
tical perspective and discuss some of the potential 
challenges. Obviously, it would take volumes or se- 
ries to do justice to all of the work in statistical 
genetics. Instead of taking on that impossible task, 
this article is oriented toward the publications di- 
rectly related to my own recent work. 

2. METHODS 

Since 1952, the American Psychiatric Association 
has published four editions of the Diagnostic and 
Statistical Manual of Mental Disorders (DSM) and 
plans to release its fifth edition in 2013. While widely 
used, the use and development of the DSM has not 
gone without controversy and criticism. Unlike dis- 
eases for which the diagnoses are well accepted by 
physicians and patients, such as cancer, the diagno- 
sis of mental disorders must refiect biological factors 
(e.g., gender and racial disparities), non-biological 
factors such as culture that are not specific to one 
person, and it also must reflect the natural variation 
within the same person. 

2.1 Ordinal Traits 

It is clear from the above discussion that a simple 
dichotomous diagnosis (e.g., yes or no), or a well- 
distributed continuous trait, is unlikely to character- 
ize the state of mental disorders. In fact, the ques- 
tions used in the diagnosis of mental disorders, such 
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as DSM-IV, are usually posed in terms of severity 
or frequency, and hence in an ordinal scale. 

Statistical methods for genetic analysis are well es- 
tablished for both quantitative (continuous) and bi- 
nary traits (see, e.g., Blackwelder and Elston, 1985; 
Goldgar, 1990; Schork, 1993; Amos, 1994; Risch and 
Zhang, 1995; Kruglyak et al., 1996; Blangero and Al- 
masy, 1997; Ott, 1999). While there has been some 
progress in the analysis of ordinal traits (e.g.. Heath 
et al., 2002; Steinke, Borish and Rosenwasser, 2003; 
Vergne et al., 2003; Zhang, Feng and Zhu, 2003; 
Feng, Leckman and Zhang, 2004; Zhang, Liu and 
Wang, 2010), especially in plant science (Rao and 
Xu, 1998; Xu and Xu, 2006), insufhcient attention 
has been paid to addressing the unique challenges of 
analyzing ordinal traits. Some researchers have rec- 
ognized that it is difficult to conduct genetic anal- 
yses of ordinal traits because such traits cannot be 
directly characterized by a linear function of genetic 
and environmental effects (Rao and Xu, 1998). To 
fill in this methodological gap, we have made a sys- 
tematic effort to develop statistical methods for seg- 
regation analysis (Zhang, Feng and Zhu, 2003), link- 
age analysis (Feng, Leckman and Zhang, 2004) and 
association analysis (Zhang, Wang and Ye, 2006) of 
ordinal traits (for family studies and case-control 
studies). 

2.1.1 Analysis of family data Long before the era 
of genomics, researchers collected data in families, 
also called pedigrees as illustrated in Figure 1. Al- 
though the ascertainment process for families varies. 
Figure 1 depicts a representative three-generation 
pedigree. The proband is the first person who enters 
into the study according to defined inclusion and ex- 
clusion criteria: such criteria are related to the dis- 
ease of interest. Other members of the proband's 
family are included and directly or indirectly as- 
sessed, depending on the circumstance. The key idea 
in analyzing family data is that if a gene is a major 
driving force behind a disease, a trace in the con- 
cordance of diseases in family members would re- 
fiect the transmission pattern of a gene under the 
Mendelian laws. This is the fundamental concept 
that Cannon and Rosanoff (1911) employed. This ty- 
pe of analysis is referred to as segregation analysis. 

The Elston-Stewart (1971) algorithm set up the 
quintessential framework to analyze data from gen- 
eral pedigrees through a technique called peeling. 
The main complication in analyzing pedigree data 
is the complex relationship among family members. 






Spouse 










Son 





Fig. 1. A three-generation pedigree. 

making it difficult to express the likelihood function 
in an easily computed form. The peeling algorithm 
makes use of the conditional independence embed- 
ded in the pedigree resulting from the Mendelian 
laws, and so peels off the complete likelihood func- 
tion into smaller pieces before putting them back 
together. 

Other methods have been relatively recently de- 
veloped using the concept of latent random variables 
(Hopper, 1989; Babiker and Cuzick, 1994; Li and 
Thompson, 1997; Siegmund and McKnight, 1998; 
Zhang and Merikangas, 2000), which are closely re- 
lated to the classic ousiotype models of Cannings, 
Thompson and Skolnick (1978) in pedigree analysis. 
The basic idea is to use latent variables to repre- 
sent the contribution of unobserved factors includ- 
ing a major gene, residual genetic factors and com- 
mon environmental factors. As discussed by Zhang 
and Merikangas (2000), the computation involving 
pedigrees is similar to the peeling algorithm. Ad- 
vantages of using latent variable based models are 
that the interactions between underlying genetic ef- 
fects and the observed covariates (e.g., demographic 
variables) can be considered. Additionally, more rel- 
evant to this article, we can accommodate ordinal 
traits in the latent variable framework. 

2.1.2 A latent variable model We follow the no- 
tation of Zhang and Merikangas (2000) and Zhang, 
Feng and Zhu (2003). Consider a trait, Y, that takes 
an ordinal value of 0, 1, . . . , Let x be a p- vector of 
covariates that is also available for each study sub- 
ject. Three types of latent random variables U\,U2 
and ?7| are introduced within family i to represent, 
respectively, (a) common, unmeasured environmen- 
tal factors; (b) genetic susceptibility of the family 
founders (a founder refers to a subject whose par- 
ents are not a part of the observed pedigree, e.g., 
father, mother and spouse in Figure 1); and (c) the 
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transmission of susceptibility genes from a parent to 
an offspring. 

The concept of latent variables is straightforward, 
but the interesting and difficult part lies in the spec- 
ification of their distributions. They need to be in- 
terpretable and convenient. The following are the 
assumptions that we found useful: 

• Ul follows Bernoulli distributions P{Ul = 1} = 
1 — P{Ul = 0} = ^1, where 6i is an unknown pa- 
rameter. 



'f^2,2n.-i'f^2,2n,)'' ^here m is 



1} = 1- 

are the 



^2 ~ (^2,1' ^2,2' ' 

the size of pedigree i. Here, ^'{C^2,2j-i 
= 0} = 6*2 when C/|,2i-i and L 
C/g-variables of a founder. 

^3 = (^3,1' • • • ' ^isi)'- According to the Mendelian 
laws, P{U^j = 1} = P{U^j = 0} = i, j = 1, . . . , Si, 
and Si is the number of parent-offspring pairs in 
family i. facilitates the transmission of U2- 
variables from the founders to the offspring. For 
example, if a parent of subject j has C/l "'variables, 
^2 2fe-i and U22k^ and the C/3-variable for this 
parent-offspring pair is C/g^, then one of subject 



j's C/2-variables is [/< 

u: 



^2,2fc-1^3J + ^2,2fc(l' 



3,1) 



• All latent variables are independent. 

Ul is a simple "switch" indicating the presence 
or absence of a shared environment factor within 
family i. U2 is assigned independently to each of 
founders who are the source for any gene to enter 
into a family, and thus mimics the transmission of 
a single major susceptibility locus with alleles A and 
a of frequencies 62 and 1 — 62, respectively. 

Conditional on all of the latent variables, denoted 
by C/*, within family i, the probability distribution 
for member j is assumed to be 



(2.1) P{Yi<k\W} 



exp(x*/3 + afc + 
1 + exp(x*/3 + afc + a*7) ' 

k = 0,...,K -1, 



where aj = {Uf, Ui2,^, + Ui2pUi2j-iUi2jf , and /3 
and 7 are p- and 3-vectors of parameters. The ak is 
the trait level dependent intercept, k = 0,. . . ,K — 1. 

As Zhang, Feng and Zhu (2003) pointed out, the /3 
parameters measure the strength of association be- 
tween the trait and the covariates, conditional on 
the latent variables. The 7 parameters indicate the 
familial and genetic contributions to the trait. The 



mode of inheritance can be inferred from 7. For ex- 
ample, 72 = and 73 7^ suggests a recessive effect. 

The likelihood function can be derived from (2.1). 
Due to the presence of latent variables, the EM al- 
gorithm (Dempster, Laird and Rubin, 1977) is the 
most convenient choice for parameter estimation 
(Guo and Thompson, 1992; Zhang and Merikangas, 
2000; Zhang, Feng and Zhu, 2003). Although Zhang 
and Merikangas (2000) and Zhang, Feng and Zhu 
(2003) presented an effective solution (e.g., a modi- 
fied likelihood), we should note that the lack of con- 
cavity in the likelihood function makes it a chal- 
lenging task to find the maximum likelihood esti- 
mates of the model parameters. In addition, the 0's 
and 7's are not fully identifiable. The identifiability 
issue not only causes computational problems, but 
also presents theoretical challenges in statistical in- 
ference. Another important, yet understudied, issue 
is the validation of the assumptions on the distribu- 
tions of the latent variables. 

But, how useful is the latent variable model (2.1)? 
First, it provides a regression framework to assess 
familial aggregation and genetic contribution, and 
possibly interactions between measured covariates 
and latent factors. Using data from a family study of 
substance use (Merikangas et al., 1998), Zhang and 
Merikangas (2000) were able to present extremely 
significant evidence of familial aggregation p-value 
<10~^ for alcohol dependence. This study addition- 
ally demonstrated that transmission does not follow 
a major locus pattern. In retrospect, their findings 
predicted the difficulty of identifying major genes 
associated with alcoholism. In addition, Zhang and 
Merikangas (2000) presented simulation examples 
to delineate when the absence of latent variables 
in (2.1) affects the estimates of the effects by the 
measured covariates. For example, hypothetically, if 
the greater presence of females in a family has an 
impact on the well-being of the family, ignoring the 
familial latent variables is likely to result in a biased 
estimate of the sex difference. 

Not only is it important to include the latent fac- 
tors, but also it is important to adjust for covariates. 
To further illustrate this point, Zhang, Feng and Zhu 
(2003) reported the following simulation. Ten thou- 
sand data sets were generated from model (2.1) with 
61 = 0.3, /3 chosen from 0, 1, 5 or 10, 71 from 0, 1 
or 2, oq = —1 and ai = 1. To focus on the differ- 
ence of having or not having covariates, they set 
72 = 73 = 0. Each data set consists of 200 families 
with 7 family members (similar to Figure 1). One 
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Table 1 

The probability estimates of rejecting 71 = at the 
significance level of 0. 05. The covariate is omitted from the 
testing despite the fact that its coefficient j3 may not be zero 





71 =0 


71 = 1 


71 =2 


/3 = 


0.0494 


0.9503 


1.0 


/3 = 1 


0.0534 


0.9843 


1.0 


/3 = 5 


0.1667 


0.9971 


1.0 


/3 = 10 


0.3828 


0.9890 


1.0 



covariate x was generated as follows. For family i, 
U\ were generated according to whether a random 
number rn from the uniform(0, 1) was greater than 
0.3 or not. For member j in family i, an indepen- 
dent random number rj-,-2 from the uniform(0, 1) was 
generated. Then, Xij = 0.9rjj2 + 0.2rji. 

To evaluate the performance of the test statistic, 
the covariate was deliberately ignored in the test. 
When /3 = 0, the covariate played no role in the data 
generating process. The row corresponding to /3 = 
in Table 1 displays the p-value (the column corre- 
sponding to 71 = 0) and the power for two values 
of 71 (1 or 2). 

When /3 7^ 0, the covariate plays a role in the data 
generating model. The data in Table 1 reveal the 
consequence of ignoring the covariate, which is more 
severe when the effect of the covariate is greater. 

2.1.3 Linkage analysis While linkage analysis has 
a long history, it only became a common practice af- 
ter the availability of several convenient computing 
programs (Ott, 1974; Kruglyak et al., 1996; Almasy 
and Blangero, 1998). For statisticians, some of the 
common terminologies in linkage analysis are puz- 
zling, including the so-called LOD-score method and 
nonparametric method. 

Morton (1955) first introduced the term "LOD- 
score." LOD stands for "the logarithm (base 10) of 
odds." The "odds" is a probability ratio, or likeli- 
hood ratio, of the probability under an alternative 
hypothesis to the probability under the null hypoth- 
esis. The LOD-score method is essentially a log- 
likelihood ratio test with two fundamental differ- 
ences: (a) the use of the base 10 logarithm versus 
the natural logarithm; (b) the log-likelihood ratio 
statistic has a multiplier of 2 conforming to a 
distribution under certain regularity conditions. 

Specifically, the LOD-score is the log(base 10)- 
ratio of the likelihood when the recombination frac- 
tion is less than 1/2 (i.e., two loci are not on the 



same chromosome, or called unlinked), to the like- 
lihood when the recombination fraction is 1/2 (no 
linkage) . The recombination fraction is the frequency 
that a chromosomal crossover occurs between two 
loci (or genes) during meiosis; 1% of combination 
frequency is termed the distance of one centimorgan 
(cM) in a genetic linkage map. Because the LOD- 
score is in base 10, a score of 3 indicates 1000 to 
1 odds in favor of the linkage, which is the conven- 
tional threshold for declaring the evidence for link- 
age. If we convert a LOD-score of 3 into the stan- 
dard log-likelihood ratio statistic, it yields a p- value 
of 2 X 10~^ under x^. By Bonnferoni correction, it 
corresponds to a genomewide p- value of 0.05 for 250 
markers. This number is in the range for the number 
of microsatellites used in typical linkage studies. 

In order to compute the LOD-score, we first need 
a number of parameters that determine the likeli- 
hood for a given recombination fraction. Then use 
the maximum likelihood over the recombination frac- 
tion for the likelihood under the alternative hypoth- 
esis. The parameters that are required include the 
mode of inheritance, penetrance, and disease allele 
frequency. These parameters are generally unknown 
and difficult to estimate for complex diseases in- 
cluding mental illnesses. For example, using segrega- 
tion analysis (see Section 2.1.1) Pauls and Leckman 
(1986) examined specific genetic hypotheses about 
the mode of transmission of Gilles de la Tourette's 
syndrome, by performing segregation analyses in 30 
nuclear families (two-generation pedigrees). They 
concluded that Tourette's syndrome is inherited as 
an autosomal dominant trait (one copy of the ab- 
normal allele is sufficient to cause the disease). The 
penetrance (the probability of having the disease for 
a given genotype) was reported at 0.71 in males 
and 1.0 in females with at least one abnormal al- 
lele. After several decades of research, no major ge- 
netic variant has been identified for Tourette's syn- 
drome, and most likely this syndrome involves mul- 
tiple genes, interacting with environmental factors. 
This reality makes it difficult to infer the mode of in- 
heritance, penetrance, and disease allele frequency, 
and conceptually, this may not make sense for com- 
plex diseases (non-Mendelian inheritance). 

This difficulty is somewhat alleviated since the 
LOD-score method has been found to work reason- 
ably well (e.g., Abreu, Greenberg and Hodge, 1999) 
under various parameter settings. There have been 
some efforts to improve the robustness of the method 
(Gastwirth, 1966, 1985; Whittemore, 1996). See 
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Zheng et al. (2009) for a thorough review. Exist- 
ing methods do not extend to the case of ordinal 
traits. The effectiveness of the robust methods re- 
mains to be studied. Naturally, nonparametric link- 
age methods have been developed to avoid specifi- 
cation of the genetic model parameters. In statis- 
tics, "nonparametric" methods typically refers to 
distribution-free methods such as rank-based tests 
and methods based on the empirical distribution. 
In linkage analysis, however, "nonparametric" does 
not mean "distribution-free," but instead refers to 
the replacement of true genetic model parameters 
with the parameters of inheritance of markers, hy- 
pothesized to be close to the disease locus. Thus, 
with nonparametric linkage methods, we still need 
to compute the likelihood. Two core algorithms are 
used to compute the likelihood: the Elston-Steward 
algorithm (1971) and the Lander-Green (1987) algo- 
rithm. As previously discussed, the Elston-Steward 
algorithm (1971) is a peeling algorithm that makes 
the computation in a large pedigree feasible by split- 
ting it into small pieces. This algorithm was im- 
plemented in early versions of linkage analysis pro- 
grams (e.g., LIPED and LINKAGE); computational 
time increased linearly in family size, but exponen- 
tially with the number of loci. More recent programs 
(e.g., GENEHUNTER) use the Lander-Green (1987) 
algorithm that has first-order complexity in the num- 
ber of loci, but unfortunately exponential in the 
family size. Although Markov chain Monte Carlo 
methods have been used to accommodate linkage 
analysis of large families and a large number of mark- 
ers (Guo and Thompson, 1992), in practice, one may 
have to break large pedigrees apart in order to run 
programs such as GENEHUNTER. 

We should note that there had not been a linkage 
analysis program to handle ordinal traits until the 
release of LOT (Zhang et al., 2008). Typically, the 
methods for linkage analysis can be divided into two 
main steps; only the second step involves the trait 
(Kruglyak et al., 1996). The first step infers how ge- 
netic information travels in a family as represented 
by the so-called "inheritance vector." 

We will use the pedigree in Figure 1 to illustrate 
this concept. The two parents and spouse are the 
founders of the family, meaning that their parents 
are not in the current pedigree. The four siblings 
and the child are nonfounders. The inheritance pat- 
tern at marker locus t is completely described by an 
inheritance vector v{t) = {vi,V2,V3,Vi, . . . ,vg,vioy . 
In other words, we devote two elements for every 



nonfounder. The founders are not included because 
they are the sources of the genes in the family and 
the inheritance vector is conditional on their genes. 
The paired elements describe the outcomes of the 
paternal and maternal meioses transmitted to the 
nonfounders. Specifically, V2j~i = 1 or 2 according 
to whether the grand paternal or grand maternal al- 
lele is transmitted in the paternal meiosis to the jth 
nonfounder. V2j carries the similar information for 
the corresponding maternal meiosis, namely, V2j = 3 
or 4 according to whether the grand paternal or 
grand maternal allele was transmitted in the ma- 
ternal meiosis to the jth nonfounder. 

In practice, the genetic markers do not always al- 
low us to determine the true inheritance vector. In 
this case, the inheritance distribution is the condi- 
tional probability distribution over the possible in- 
heritance vectors that conform with the alleles ob- 
served at t, which we denote by p{v{t) = w} for all 
inheritance vectors w ^V; here V is the set of all 
possible inheritance vectors. In the absence of any 
genotypic information, all inheritance vectors are 
equally likely according to Mendel's first law; the 
probability distribution is uniform. 

For segregation analysis, we employed latent vari- 
ables to reflect the "imaginative" genetic effects 
in (2.1). In linkage analysis, we have genetic markers 
that flow through the inheritance vector. Thus, we 
can still use (2.1) for linkage analysis except that a*- 

should be (C^i, i'2j_i + ^2v2j)- hand, we 

have a reduced number of latent variables. On the 
other hand, many of the latent variables depend 
on each other through the inheritance vectors. The 
computation of the likelihood would be summed over 
all inheritance vectors w in V, in addition to the 
probability space of the remaining independent la- 
tent variables. Because of this connection and dis- 
tinction, the challenges in the linkage analysis of or- 
dinal traits are, to a great extent, similar to those 
in segregation analysis of ordinal traits, for example, 
the asymptotic mixture of x^-distributions and the 
need to introduce the penalized likelihood (Liang 
and Rathouz, 1999; Zhang, Feng and Zhu, 2003). 

2.1.4 Association test As discussed above, linkage 
analysis focuses on testing the position of a marker, 
although it has been difficult to replicate findings in 
linkage studies of mental disorders. An association 
analysis, however, tests whether a genetic variant, 
including particular allele or genotype of a marker 
and a haplotype in several markers, is associated with 
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a trait. Some study cohorts recruited for linkage stu- 
dies have been re-genotyped for genomewide associa- 
tion analyses. For binary or quantitative traits, many 
methods have been developed and implemented. Two 
commonly used programs are PLINK (Purcell et al., 
2007) and FBAT (Rabinowitz and Laird, 2000). To 
analyze an ordinal trait, Zhang, Wang and Ye (2006) 
introduced the following proportional odds model: 

(2.2) logit{P{yij < k\Gij)} = Ofc + Pdj, 

where ao, . • . , cxk-i are non-descending level param- 
eters, /? is the genetic effect. The genetic factor Cij 
can be chosen to reflect the underlying mode of in- 
heritance such as the number of the risk allele. Un- 
der model (2.2), the null hypothesis is Ho:l3 = 0. 
The score statistic is 

(2.3) S = ^[R^-ivij) - R-iyij)]Aij, 

where R'^{yij) and R~{yij) are the counts of off- 
spring in the entire sample whose trait values are 
greater or less than yij, respectively, and Atj is the 
number of copies of transmitted alleles at the marker 
locus. Thus, Zhang, Wang and Ye (2006) proposed 
the following 0-TDT test based on the score statis- 
tic: 

[S-E{S\Y)]'^ 
\av{S\Y) ' 

which follows a Xi-distribution asymptotically. For 
a case-control study, R'^{yij) and R~{yij) are the 
numbers of subjects whose trait values are greater 
or less than y^j, respectively. 

If we rewrite the statistic in (2.3) in a general form 
as jWijAij, this yields the classic TDT when 
Wij = 1 and the QTDT (Rabinowitz, 1997) when 
Wij = yij — y, where y is the average of all yj/s. In 
other words, all of these tests are a weighted function 
of the number of transmitted alleles at the marker 
locus, and the choice of the weights depends on the 
property of the trait. With this observation, after 
the proper weights are computed, the existing FBAT 
software can be used to test the association between 
any trait and alleles at a marker locus. 

In the following, we describe a unified method to 
choose weights for any kind of trait. It is straight- 
forward to categorize a quantitative trait into any 
reasonable number of categories (such as deciles) 
and induce an ordinal scaled trait. This would al- 
low the use of the 0-TDT for a quantitative trait. 
In their simulation studies, Zhang, Wang and Ye 



(2006) demonstrated that this strategy has compa- 
rable power to the QTDT for quantitative traits. 
This is due to the fact that the number of categories 
is enough to capture most of the information in the 
data (e.g., following Cochran's rule; Cochran, 1977). 
The advantage is that the ordinal scaled test is not 
affected by the nonnormal distribution of a quanti- 
tative trait, and so, the unified approach is robust. 

One limitation of the test proposed by Zhang, 
Wang and Ye (2006) is that it does not adjust for co- 
variates. Environmental factors or covariates, such 
as gender and age, may confound the association of 
interest. In a subsequent work, Wang, Ye and Zhang 
(2006) generalized model (2.2) to include covariates 
as follows: 

(2.4) \og\t{P{yij <k\Gij),Zij} = ak + (3cij + 6'zij, 

where Zij denotes the covariates and 5 is the vector 
of the corresponding coefficients. Consequently, the 
score statistic becomes 

(2.5) S = ^[^{yij, Zij) - j{yij - l,Zij)]Aij, 
where 

7(fc,.)= ^"P("^ + ^%) , 
1 + exp{ak + 6' Zij) 

which is the estimated probability of having a trait 
value no greater than k. Thus, the weight function 
in (2.5) is the difference between the probability of 
having a trait value greater than y^j and the proba- 
bility of having a trait value less than yij . Not sur- 
prisingly, this is in essence the same as the weight 
function in (2.3) where we used counts instead of 
frequency (or probability). 

It is important to note that association analysis 
does not directly equate to a causal relationship. 
In well-designed genetic association studies, an ob- 
served association is expected to result from either 
a causal functional variant of a gene, or the linkage 
disequilibrium between the marker and a suscepti- 
bility gene. In population-based case-control stud- 
ies, there are typically attempts to match cases and 
controls by important demographic and/or baseline 
information. It is not wise to over-match subjects. 
Alternatively, we can collect potentially important 
environmental variables and consider them in the 
association analysis. We can also use principal com- 
ponent analysis on the genotypes to explore whether 
there are "clusters" in the study cohorts that are not 
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appropriately reflected in the environmental vari- 
ables. In family-based studies, the association tends 
to be conditional on parental genotypes and all phe- 
notypes. 

2.1.5 Unique challenges in analyzing ordinal traits 
Understanding the genetic mechanisms for complex 
diseases is challenging regardless of whether we an- 
alyze binary, ordinal or continuous traits. Any chal- 
lenges that exist for analyzing binary and continu- 
ous traits remain for ordinal traits. What are the 
unique challenges in analyzing ordinal traits? The 
key difference is that there is not a simple distri- 
bution function for ordinal traits. For continuous 
traits, the assumption is that the traits can some- 
how be treated under normality, by transformation 
if needed. For binary traits, through a link function 
(e.g., logit) we only need to deal with a Bernoulli dis- 
tribution. However, for ordinal traits, the two typical 
approaches are (a) to assume a reliability variable or 
a continuous latent variable or (b) to assume a pro- 
portional odds model as we presented above. The 
first challenge is in the estimation. The likelihood 
function is complicated, and based on the numeri- 
cal results, it has multiple local maxima. In addi- 
tion, due to identifiability (or near-identifiability) , 
the likelihood function may be relatively flat. Com- 
binations of the EM and other algorithms can pro- 
vide practical solutions, but finding a more efficient 
algorithm is an open problem. 

The second challenge is in the inference. When 
latent variables or mixture distributions are used, 
some of the commonly assumed regularity condi- 
tions do not hold. One solution is to use a penalized 
likelihood function (Zhang, Feng and Zhu, 2003) 
that prevents the parameters from being near the 
singularity points. 

Finally, model diagnostics are difficult. For ex- 
ample, how do we know the latent variable-based 
model or the proportional odds model provides an 
adequate fit to the data? Although the models and 
methods presented above do not address this and 
other questions, they provide a foundation for fur- 
ther research and improvement. 

2.2 Comorbidity 

The methods described above only deal with a sing- 
le trait. However, comorbidity is the rule rather than 
the exception in studies of mental and behavioral 
disorders. For example, a patient may suffer from 
both anxiety and depression (Li and Burmeister, 



2009), and the same patient may also be addicted 
to nicotine, alcohol, or other substances (Merikangas 
et al., 1998; True et al., 1999). From a data analysis 
perspective, we need to consider how important it is 
to accommodate multiple diseases/traits. In a real- 
data example, Chen et al. (2011) analyzed a data set 
from the Study of Addiction: Genetics and Environ- 
ment (SAGE). By simply considering addiction to 
at least two of the six substances (addiction to nico- 
tine, alcohol, marijuana, cocaine, opiates or other 
drugs), we were able to identify the PKN0X2 gene 
that reached genomewide significance level among 
European-origin females. Interestingly, the PKN0X2 
gene has been previously identified as one of the cis- 
regulated genes for alcohol addiction in mice (Mul- 
ligan et al., 2006). To further delineate the benefit 
of considering multivariate traits, Zhu and Zhang 
(2009) conducted comprehensive simulation studies, 
considered the correlations of 0.2 and —0.2 among 
three quantitative traits, and demonstrated that tes- 
ting correlated traits jointly is more powerful than 
testing a single trait at a time. Using generalized 
estimation equation, Lange et al. (2003) developed 
a family-based association test for multivariate quan- 
titative traits (FBAT-GEE). Recently, Zhang, Liu 
and Wang (2010) constructed a nonparametric test 
based on the generalized Kendall's tau to accommo- 
date any combination of dichotomous, ordinal, and 
quantitative traits. 

2.2.1 Kendall's tau Kendall's r is a rank-based 
correlation between two variables. It contracts the 
probability of observing the two variables in the sa- 
me order in two observations with the probability of 
observing the two variables in the opposite order. 
Specifically, for a sample of n observations {Xi,Yi), 
. . . , {Xn,Yn), two observations (Xj, 1^) and {Xj,Yj) 
are called concordant if (Xj — Xj){Yi — Yj)>0 and 
discordant if {Xi - Xj){Yi -Yj) < 0. Then Kendafl's r 
is based on the difference between the numbers of 
concordant pairs and discordant pairs. 

We introduce a kernel function, 

(^{{Xi,Y^,{X,,Y,)) 

= sign{{Xi -Xj){Y,-Y,)} 

r 1, i{{X,-Xj){Yi-Yj)>0, 
= { -1, if{Xi-Xj){Yi-Yj)<0, 
(O, ii{X,-Xj){Yi-Yj) = 0, 

and define a [/-statistic 

(2.6) ^^=(2) 'Y.^{{Xi,Y^,{X„Y,)). 
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Then, Kendall's r is 



A/Varo(i7)' 

where Varo(?7) is the variance of U under the null 
hypothesis of no correlation between X and and 
equal to n(n — l){2n + 5)/18 if X and Y are contin- 
uous variables (Hollander and Wolfe, 1999). 

2.2.2 Generalized Kendall's tau To test the asso- 
ciation between genetic markers and comorbidity, 
Zhang, Liu and Wang (2010) generalized Kendall's 
tau as follows. For individuals i and j', let Tj and Tj 
be their vectors of traits, respectively. Then, a trait 
kernel is defined as 

F,, = (/i(TW-r,W),...,/,(r/^)-7;(^)))', 

where function fk{^) is the identity function for 
a quantitative or binary trait (Rabinowitz, 1997), or 
the sign function for an ordinal trait (Zhang, Wang 
and Ye, 2006). 

Also, recall that, as in Section 2.1.4, c is the num- 
ber of any chosen allele for marker genotype and 
let Ci refer to the C for the ith subject. Then, 
Zhang, Liu and Wang (2010) defined a marker ker- 
nel as 

Their [/-statistic is defined as 

(2.8) ^={i) 'E^^^^^r 

The association test statistic, or generalized Ken- 
dall's tau, is C/'VarQ^ ([/)[/, where Varo(f7) is the 
variance of U under the null hypothesis that there is 
no association between marker alleles and any linked 
locus that influences the trait T. The test statistic 
follows an asymptotic x^-distribution under the null 
hypothesis. 

Obviously, the statistic in (2.8) does not incorpo- 
rate covariate effects. This is relatively straightfor- 
ward for a single trait as was done in (2.4). Here, 
the traits can be a hybrid of different traits. An al- 
ternative is to impose different weights for each pair 
of samples in the statistic (2.8) according to the in- 
formation of their covariates. The weight, denoted 
by w{zi,Zj) for the pair (i,j), reflects the relative 
importance attributed by the covariates when we 
derive the statistic. Zhu, Jiang and Zhang (2010) 
examined the following weight function. Write z = 
(z™, z^^)' with = (z(i), . . . , z('i))' for the contin- 
uous covariates and z'^^ = (z^'^"*"^), . . . , z^'^)' for the 



categorical covariates. They defined the weight func- 
tion w{zi,Zj) as 

(2.9) w{z,,z,) = W{\\zr - z]^)I{zr = zf), 

where W{-) is a positive and decreasing function, 
for example, W{u) = exp(— n^/2/i^), and /(•) is the 
indicator function. Then a weighted test statistic is 
given by 

(2.10) 5=^2) 5^AiF*j^i'(^i,2i). 

Zhang, Liu and Wang (2010) and Zhu, Jiang and 
Zhang (2010) showed that under the null hypoth- 
esis, the test statistic 5 (weighted or not) has the 
following asymptotic distribution conditional on all 
phenotypes and parental genotypes: 

Varo ^/'(5)[S - Eq{S)] A iV(0,/p), 

where 

2 ^ 

Eo{S) = -y2u^Eo{a\Mf^), 

4 

Varo(5) = 

(n — Ij^ 

n n 

■ ^ ^^^0 CoMCi, C, |Mr, Mf ). 

i=l j=l 

Consequently, the following test statistic 

xL = [S- Eo{S)]'Yav^HS)[S - Eo{S)] 

converges to Xp iii distribution under the null hy- 
pothesis provided that Varo(5) is full rank. In a ca- 
se-control study, we do not have the markers from 
parents and hence the conditional expectations are 
replaced with the unconditional ones. Thus, the key 
difference in the test statistics between family stud- 
ies and population studies lies in the conditioning 
on the parental markers. The conditioning on the 
parental markers gives the family studies a major 
advantage in removing the effect of population ad- 
mixture, but family studies tend to be more difficult 
and expensive to carry out. 

Under the alternative hypothesis, the test statis- 
tic Xtau be written as a weighted sum of noncen- 
tral Xi = I]?=i ^ixi{4>i), where ei > • • • > Cp are the 
nonnegative eigenvalues of sJ^^Sq ""^s]^^^. (pi = 

— 1/2 

and fiR^ is the ith component of /x^ = Q'Si /i, 
where Q is an orthonormal matrix such that 
QnY'^'EQ^'Ey^Q' = diag(ei, . . . , Cp). fi is the differ- 
ence in the means of S under the alternative and 
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null hypotheses. Using the approximation theory of 
Pearson (1959), Solomon and Stephens (1977) and 
Liu, Tang and Zhang (2009), we can find a certain 
degree of freedom I and noncentral parametric v 
such that the distribution of Xtau can be closely ap- 
proximated by xf(^)- Through simulation studies, 
Zhu et al. confirmed that this approximation is ac- 
curate enough for power calculation. 

It is noteworthy that the weight function in (2.9) 
is restrictive with respect to categorical covariates, 
especially so for ordinal covariates. The use of ge- 
nomic propensity score can give rise to an alter- 
native weight function. Specifically, for a di-allelic 
marker G (e.g., SNP), the genomic propensity score 
is the conditional probability pg{z)=P{G=g\Z=z). 
This probability can be fitted by a logistic regres- 
sion model or proportional odds model depending 
on whether G is chosen as an allele type or geno- 
type. In the latter choice, the model also depends on 
the mode of inheritance. In the current genomewide 
association studies, we usually only have genotypes 
and cannot distinguish the phases of individual alle- 
les. Thus, we have to construct genomic propensity 
scores by considering various modes of inheritance. 
Once the genomic propensity score is estimated, it 
can be treated as a numerical covariate and then we 
can use (2.9) again. 

2.2.3 Examples Zhang, Liu and Wang (2010) re- 
analyzed a data set from the Collaborative Study 
on the Genetics of Alcoholism (COGA) (Begleiter 
(1995); Edenberg et al. (2005)). The data came from 
a multi-center (9 sites) consortium that recruited 
study participants by requiring every proband to 
meet two alcohol dependence diagnostic criteria ba- 
sed on DSM-IV-R (American Psychiatric Associa- 
tion, 1994). The first-degree relatives of the probands 
were invited into the study. Zhang, Liu and Wang 
(2010) included a total of 1614 individuals from 143 
families. They considered three phenotypes: (1) al- 
cohol DX-DSM3R + Feighner; (2) maximum num- 
ber of drinks in a 24-hour period; and (3) the re- 
sponse to "spent so much time drinking, had little 
time for anything else." Using the first phenotype 
alone, the p- value of the association between a peak 
marker D7S679 on chromosome 7 and the trait was 
0.0019. However, when the three traits are analyzed 
together, D7S679 remains the peak marker, and the 
p-value is reduced to 0.00055, demonstrating the 
possibility that the other two phenotypes enhanced 
the association signal. If the other two phenotypes 
are analyzed alone, the analysis did not lead to any- 
thing worthy of further attention. 



In the analysis cited above, the association was 
assessed without considering covariates. In a follow- 
up analysis, Zhu, Jiang and Zhang (2010) considered 
two important covariates: age at interview and sex. 
When these two covariates were controlled for, the 
p-value of the association between the peak marker 
D7S679 and the three phenotypes went down further 
to 0.000313. 

3. DISCUSSION 

Studying comorbidity is a significant issue in men- 
tal and behavioral research, dating back to a century 
ago (Cannon and Rosanoff, 1911). This is challeng- 
ing due to a lack of statistical methods that accom- 
modate the complexity of comorbidity. While deal- 
ing with comorbidity in genetic studies is the focus 
of this review, it is achieved through gradual de- 
velopment, and accumulation of methods. Various 
challenges are dealt with along the way. 

Although I focused on the analysis of ordinal traits 
and applications in mental health, the presented me- 
thods are closely related to robust and rank-based 
methods for binary and quantitative traits. Further- 
more, ordinal traits arise in studies of diseases be- 
sides mental illnesses, such as cancer (specifically, 
different stages). 

From the statistical perspective, the methods that 
are presented here have broad applications beyond 
genetic association studies. From college admissions, 
to job searches, to scientific investigations, we make 
inferences based on multidimensional data. It is im- 
portant and imperative to consider and develop in- 
ferential tools for multivariate outcomes, particu- 
larly when the outcomes are discrete. There is ex- 
tensive literature on the statistical analysis of mul- 
tivariate normal variables as well as on nonparamet- 
ric tests for a single variable of nonnormal distribu- 
tion. However, few options are available for the infer- 
ence when we have multiple nonnormally distributed 
variables and potential hybrids of continuous and 
discrete variables. To overcome this challenge, I pre- 
sented several useful statistical techniques such as 
the rank-based [/-statistics and the kernel-based 
weighted statistics to accommodate the mix of con- 
tinuous and discrete outcomes and the presence of 
important covariates. 
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